Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make language names unique #149

Merged
merged 8 commits into from
May 16, 2024
Merged

Make language names unique #149

merged 8 commits into from
May 16, 2024

Conversation

simoncozens
Copy link
Contributor

@simoncozens simoncozens commented May 14, 2024

This improves CI testing for textproto parseability, and the uniqueness of language names. I've noticed that we have a system for disambiguating language names: "<Region>ian <Language>" where there are regional variants, and "<Language>, <Script>" where the language is written in multiple scripts; so I've tried to extend this to the cases which are currently doubling up. Additionally for Baka there are two different, unrelated languages with the same name, so I have disambiguated them with "Baka (<Region>)".

I've also removed mlt_Latn (Maltese) because we have a better mt_Latn. @moyogo, does that sound fair?

@moyogo
Copy link
Contributor

moyogo commented May 14, 2024

This seems good to me.
I'm a bit confused by "Bassari (Senegal)" and "Guinean Bassari".

"Baka (South Sudan/Congo)" should probably be with "DRC" or "Congo-Kinshasa" as "Congo" is ambiguous. Should multiple countries be in a specific order?

names = Counter([lang.name for lang in LANGUAGES.values()])
if any(count > 1 for count in names.values()):
duplicates = {name: count for name, count in names.items() if count > 1}
pytest.fail(f"Duplicate language names: {duplicates}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: much nicer to read sorted



def test_language_uniqueness():
names = Counter([lang.name for lang in LANGUAGES.values()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When present, I think you want to use lang.preferredName instead of lang.name.

That said, you might consider using that preferredName field in some of the language proto files in this PR. I don't actually see specific examples where that would make more sense, but worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, that found extra duplicates. :-)

@simoncozens simoncozens merged commit 9b4d8d2 into main May 16, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants